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In this modern age, several new methods have been developed, especially in 
image processing for agriculture business, which consists of technologies 
derived from artificial intelligence (AI) capabilities called machine learning. 
Classify is a widely used method to analyze patterns, trends, as well as the 
body of knowledge from the data visualization. Image classification 
application improves discrimination and prediction efficiency. The objective 
of this research was to feature extraction of sweet tamarind and compare the 
algorithm for classification. This research used images from golden sweet 
tamarind species with the use of MATLAB and python language. The steps 
of this research consisted of 1) preprocessing step for finding the distance to 
appropriate of the image quality, 2) feature extracting for finding the number 
of black pixels and the number of white pixels, perimeter, diameter, and 
centroid, and 3) classifying for algorithms' comparison. The results showed 
that the camera's distance to the image was 60 cm. The coefficient of 
determination was at 0.9956, and the standard error of estimate was 
7,424.736 pixels. The conclusion of classification found that the random 
forest had the highest accuracy at 92.00%, SD. = 8.06, precision = 90.12, 
recall = 92.86, and Fl-score = 91.36. 
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1. INTRODUCTION 


Artificial intelligence (AI) has played numerous important roles in the development of 
organizational functions. Implementation of the technologies [1] such as image processing [2], internet of 
things (IoT) [3], intelligent control [4] and robotics, signal processing, natural language processing (NLP) 
[5], and big data analytics [6]. The implementation of AI and image processing technology in the agriculture 
sector [7], this integration concept promotes the development of innovations, tools, and methodologies which 
can help smart farming more effectively. 

Smart agriculture system improves efficiency and increases productivity per area using technology 
and artificial intelligence in agriculture. The intelligence system starts from a seed selection, soil quality, 
monitoring throttling light and temperature. Using AI and IoT systems can determine the proper amount of 
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nutrients and water, planting management, and pest control. An innovative agriculture system enables 
accurate production forecasts, reducing costs and reducing the use of human labor. 

In 2020, the smart agriculture technology market was worth more than $26 billion and compound 
annual growth rate (CAGR). It is worths more than one-third of the market value in the Asia-Pacific region. 
The value of the smart agriculture market in Thailand was $128.7 million in 2018 and is expected to reach 
$269.9 million in 2022. Apparently, an intelligent algorithm like machine learning and tools [8] allows 
automated machines to learn and utilize the trained algorithms to solve problems for humans. 

In this research, the classification of distinctive features of images was used, designed for various 
applications. For example, image classification algorithms are used to learn and classify data. The algorithms, 
such as k-nearest neighbors (KNN) [9], decision tree [10], random forest [11], support vector machine 
(SVM) [12], and logistic regression [13] were used. In addition, selecting the most appropriate data 
classification algorithm to extract the distinctive features of the images in the dataset is crucial. The 
algorithm will measure the models' best performance, including accuracy, precision, recall, and f1l-score. 
This research divided data into two parts: a training set and test set by k-fold cross validation (K-fold), 
stratefied k-fold (SK-fold) and leave one out cross validation (LOOCYV). In this study, 5-fold cross validation 
was combined with python language to classify data. It is also used to obtain reliable results to measure the 
extraction efficiency and key features of the sweet tamarind. 

The results were then used to develop a standardized size sorting tool for sweet tamarind. More 
results can be applied in sorting the packing size of the sweet tamarind for export. The abovementioned 
technologies have been used in the agricultural sector to increase production efficiency, reduce the 
production cost, and save time and labor. Moreover, the production quality can be precisely controlled. 


2. THEORETICAL BACKGROUND AND RELATED RESEARCH 

2.1. Image processing 

Image consists of small units called pigment or pixel in which each point has a numerical value. 
There are 4 types of general images processing: 

— Original images—an image formed by a combination of three primary colors which vectors are showing 
red, green, and blue values [14]. 

— Gray scale image — an image with a color in grayscale. In 8-bit, there are 256 possible colors ranging 
from black (0) to white (255) [15]. 

— Binary images—an image consisting only of black and white. There are only two values in black and 
white image dots: black as 0 and white as 1 or 255 [16]. 

—  Histogram—graph which presented total numbers of dots in the image. The horizontal axis shows the 
intensity level from 0-255. When the grayscale is low, the intensity with less values will be seen as 
black. If grayscale is high, this means that the image is very intense and will be seen as white. The 
vertical axis of the graph shows a number of image dots valued at each intensity range. 

Feature extraction for image analysis [17], feature is an analysis of the main characteristics to find 
representatives of data to use to represent that image [18]. This theory on image analysis is commonly used 
by using feature extract of objects to find specific features and used them to analyze the characteristics of the 
image of sweet tamarind. This enables finding the sweet tamarind standard size. The methods for 
representing images with specific image characteristics commonly used the properties of color, texture, 
shape, and histograms of oriented gradiants (HOG) [19]-[20]. 


2.2. Classification algorithm 

Region, area recognition of an object, or boundary, recognition of an object’s boundary, which is 
also known as pattern are the working approaches of shape classification [21]. There are two types of shape 
classification methods: decision theoretic and structural aspects [22]. Decision theoretic type recognizes 
quantities such as length, area, and texture [23]. The second type uses a pattern, which works best for 
classification. In this research, classification algorithm consists of KNN, decision tree, random forest, support 
vector machine (SVM) and logistic regression [13]. 


2.3. Validation of method verification 

Verification is a method used to test the ability of the model that will be used as a model to predict 
data that needs to be classified in the future. This is done by considering the percentage and accuracy of the 
prediction of the data to be categorized. In this research, there are two methods for validation: k-fold cross 
validation (k—fold), Stratefied K-fold (SK-fold) and leave one out cross validation (LOOCV). 
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2.3.1. K-fold cross validation (K-fold), stratefied K-fold (SK-fold) 

K-fold CV is a method for dividing dataset into parts. The principle is summarized as to divide the 
data into k parts [11]. One piece of data is required to test a model that has been created referred to as test set. 
The rest of the k-1 data is used to create the model in classification and often called training set, alternating 
the test until all parts are used as the test set. After that, the percentage of the correct prediction is considered 
which most of the training sets are divided into three-fourths of the total data. The remainder is a test set 
which is considered as one of the most commonly used methods for predicting the data to be classified. 


2.3.2. Leave one out cross validation (LOOCV) 

LOOCYV is another method wildly accepted equally to the method one. The principle in summary 
is assuming that there are total of N data, one value will be taken out which remains N-1 to be used to create 
a model in the classification of groups [13]. After data was pulled out, calculations will keep repeating in this 
manner until all data is correct and then used for testing [14]. After that, percentage of the correct prediction 
is considered. 


2.4. Sample 

Sweet tamarind is a popular fruit highly consumed in many countries across the world including 
Thailand. Sweet tamarind can be stored for a long time. In 2019, the volume of sweet tamarind being 
exported were as high as 19,902.6 tons [24]. The value of export was 440.07 million baht. Thailand’s major 
trading partners are China, United Arab Emirates, Vietnam, the United States of America, Malaysia, 
Singapore, Indonesia, Saudi Arabia, Laos, and Kuwait. Most of them export around 70% and sell around 
30% domestically. The main countries that order most sweet tamarinds are China [25] and Vietnam, but they 
do not require high quality tamarind or grade-B products. While countries such as the United States of 
America, European countries, and the Middle East would only order high quality, or grade-A products. 

This research investigates the feature extraction method that provides the correctness value of the 
dataset [26], [27] but the improve of previous research were to enhanced algorithm of classification 
techniques and feature extraction of texture of objects. This was done by utilizing the image processing and 
using data classification from the research of [9], [28]-[30] with feature extraction method and used to create 
criteria for sweet tamarind quality classification by image processing method. Calculating the area of an 
image in pixels and comparing the algorithm performance to classify data by defining the extraction interval 
method using sum and variance of mean number of pixels can be used to sort sweet tamarind grades and 
quality. 


3. RESEARCH METHOD 

150 pieces (pictures) of high quality golden sweet tamarind were used. The samples were divided 
into two groups depending on their sizes: 80 standard A size and 70 standard B size. A size was high quality 
than B size, that consists of 1) A size big than B size and 2) A size have a diameter larger than B size. The 
area was set for capturing the images with black background. The image of sweet tamarind must be the 
resolution of 10 million pixels. The step of research development consisted of preprocessing, feature 
extraction and evaluation as explained in Figure 1. 
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Figure 1. Framework of research method 
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3.1. Step 1: preprocessing 

The distance between the camera and the object was determined to control the image quality of the 
tamarind and so that the measured value from the system can be read in the best image. Therefore, calibration 
at different distances was needed. The distance from the reference object to camera in this study was used at 
5 different levels, 20 cm, 30 cm, 40 cm, 50 cm, and 60 cm. The test object require paper at 7x7 cm (49 cm’), 
6x6 (36 cm’) cm, 5x5 cm (25 cm’), and 4x4 cm (16 cm’) size to determine the linear correlation trend of the 
area at R? of the standard area based on regression analysis [13]. The value then was calculated in (1). 


Vi = Po + Pixi + Boxing to + PkXir + &, t= 1,..,n (1) 


3.2. Step 2: feature extraction 

Feature extraction was used starting from making unsharp mark to filter the image before 
processing. This is to sharpen the image of the sweet tamarind. After that, the equation to calculate the black 
pixel size and white pixel size of the image are used. Threshold values are obtained by determining the color 
histogram in the pixel points of interest to separate the background image from the image. Total pixel size of 
the image was identified. The size of white pixels, size of black pixels, perimeter, cenrtoid, and diameter of 
the object were calculated from the image using MATLAB. 

— Step 2.1: Getting color image (RGB) of sweet tamarind and improve the image through unsharp mark 
method. This converts the color image of the tamarind into grayscale image with the intensity of each 
point in the position of the tamarind image. By reading the value obtained from the image taken in X, Y 
coordinates, the pixel value was adjusted, then the obtained Y values are converted back to the 
coordinates of X, Y. After all pixel values were obtained, Y values is then used to be plot in graph in a 
form of histogram to show the brightness at different stage of the sweet tamarind and the images were 
capture at different appropriate point. This method was used to separate the objects from the 
background. It can be calculated from the number of frequencies in the histograms and counting the 
values and pixel points that is less than the crop level value by counting total number of pixels of that 
object, and then calculate the average of intensity level of the object through threshold method. Lastly, 
the resulting grayscale image was converted into two values: black (0) and white (1 or 255) or binary 
image. The resulting sweet tamarind image was separated into two-pixel groups, the foreground image, 
and the background image. 

— Step 2.2: Image segmentation separates the object from the background to reduce the complexity and 
change the display boundary to make it easier to analyze. Boundary detection method was used to 
separate the images into regions from the examination of each group of objects to find the edges of the 
sweet tamarind images. MATLAB program function was used. 


boundaries = bwboundaries (binarylImage) ; 
numberOfBoundaries = size(boundaries, 1); 
for k = 1 : numberOfBoundaries 
thisBoundary = boundaries{k}; 
plot (thisBoundary(:,2), thisBoundary(:,1), 'g', 'LineWidth', 2); 
end 


— Step 2.3: Table 1 shows descriptions of the attributes of image data in database to store outstanding 
characteristics of sweet tamarinds to store in database picture. 


Table 1. Attributes of image data in database 


No Name ofdata Type of data 
1 Area Double 
2 Black pixel Double 
3 White pixel Double 
4 Perimeter Double 
5 Centroid Double 
6 Diameter Double 
T Grade Varchar 


3.3. Step 3: classification 

In this step, the extraction of dataset from tamarind image from step 2 was identified and the data 
was divided into two parts: training set and test set by k-fold cross validation, stratified k-fold, and loocv 
methods. 5-fold cross validation was used with python language for classify data. After that, four benchmarks 
were then used to measure the effectiveness of the tests: accuracy, precision, recall, and F1-score. 
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4. RESULTS 
4.1. Result of camera distance measurement on image quality of sweet tamarinds 

The Table 2 shows the result of experiment to determine the distance of the camera to the image to 
obtain the quality of the sweet tamarind image by setting the camera distances at 5 different point: 20 cm, 30 
cm, 40 cm, 50 cm, and 60 cm, and using the area of the object at 49 cm’, 36 cm’, 25 cm’, and 16 cm’. The 
finding of linear equation was to plot graph with x-axis (area of the object) and y-axis (number of pixels) for 
calculated with linear equation. In Figure 2, example of the camera distances at 60 cm. In order to test the 
straight-line equation, it was found that the R? value of the distance at 60 cm had the most significant 
relationship between the distance and the area of the object at 0.9956. The standard of error in forecasting 
caused by the regression equation was at the value 7,424.736 pixels in which in this research, it was 
photographed at 60 cm. 


Table 2. Results of linear equation for length between camera and object 
Linear Regression 


Length 


Linear Equation R? Standard Error of Estimate 
60 y = 6366.4x + 6269 0.9956 7,424.736 
50 y = 9010.3x + 304.14 0.9574 33,180.84 
40 y = 14071x - 20927 0.9829 32,410.54 
30 y = 22949x - 21099 0.9952 27,818.88 
20 y = 48104x - 126554 0.9900 84,237.99 


Coefficient of determination (R°) 


The camera distances at 60 cm. 


„9 
y = 6366.4x +6269 
R?=§:9956 
= 
0 10 20 30 40 50 60 


Figure 2. Shows the results from image processing 


4.2. Results from the featured extraction 

The Figure 3 shows the results obtained from MATLAB programming language for image 
processing are consist of: Figure 3(a) original image produced by converting the RGB image of sweet 
tamarind to black and white image through unsharp mask technique. Figure 3(b) histogram image generated 
from black and white image was used to determine the grayscale value. Figure 3(c) binary image is the 
conversion of grayscale value image into binary image (black and white) and Figure 3(d) boundaries image is 
first to determine the edges of the image. After that, it will be used to find the black pixel and white pixel to 
calculate the area of the image in the unit as pixel. The equation was used to calculate and find the number of 
black pixel and number of white pixels, perimeter, diameter, and centroid through MATLAB programming 
language. 


ss Backgrou Foreground 


4 Thresholded at 84 gray levels 


0 50 100 150 200 250 


(b) (c) (d) 


Figure 3. Results from the featured extraction steps to: (a) original image, (b) histogram image, 
(c) binary image and (d) boundaries image 
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4.3. Algorithm benchmark results for classification 

The data group was categorized into two parts: training set and test set with the use of k-fold cross- 
validation, stratified k-fold, and loocv. In this study 5-fold cross-validation was used. Moreover, four 
measures were used to measure the effectiveness of the tests which were accuracy, precision, recall, and fl- 
score. It was found that random forest had the highest accuracy. The data were then taken to find the cross- 
validation score to confirm the accuracy of the data as shown in Table 3 and 4. It was found that random 
forest had the highest accuracy at 92.00%, sd = 8.06, precision = 90.12, recall = 92.86, and Fl-score = 91.36. 
The principle of random forest was to train the same model multiple times and multiple instances on the same 
data set. Each training session will select different parts of the data that are being trained. Then take the 
decision of those models to vote on which class is the most chosen. So, the results of the algorithm found that 
the random forest was better than another algorithm. 


Table 3. Results of accuracy and standard deviation from 5-fold cross validation 


Algorithm K-Fold Stratified K-Fold LOOCV 
Accuracy SD Accuracy SD Accuracy SD 
Decision tree 89.33 4.42 88.67 11.47 85.00 35.71 
Support Vector Machine (SVM) 87.33 9.98 90.00 5.58 85.00 35.71 
K-Nearest Neighbors (KNN) 89.33 12.18 91.33 4.52 89.00 31.29 
Logistic regression 84.67 9.57 90.00 5.16 86.00 34.70 
Random forest 91.33 3.40 92.00 8.06 90.00 30.00 


Table 4. Results of evaluation matrices from stratified 5-fold cross validation 


Algorithm Accuracy SD Precision Recall _ Fl-score 
Decision tree 88.67 11.47 86.02 88.57 88.80 
Support Vector Machine (SVM) 90.00 5.58 86.30 94.29 89.92 
K-Nearest Neighbors (KNN) 91.33 4.52 88.39 94.29 90.95 
Logistic regression 90.00 5.16 87.33 92.86 89.72 
Random forest 92.00 8.06 90.12 92.86 91.36 


5. CONCLUSION 

This research utilized image extraction feature through image analysis technique in combination 
with a classifier algorithm, which used algorithms with different fundamental characteristics to compare the 
results. In this experiment a total of 150 golden, matured, sweet tamarind species were used and divided into 
two groups depending on their size. 80 pieces of standard size A, 70 pieces of standard size B, and the image 
shooting area was prepared with black background. The results from this experiment showed that the optimal 
photographic distance was at 60 cm. Then the image attributes were extracted by using MATLAB 
programming language to store image attribute data into a database. This database consisted of area, black 
pixel, white pixel, perimeter, diameter, and centroid. Then these data were then categorized into two parts 
which were training set and test set by k-fold cross-validation, stratified k-fold, and loocv method. Stratified 
5-Fold cross-validation was also used. 

After that, the model’s performance was measured with accuracy value. It was found that random 
forest had the highest accuracy, and the mentioned data was used to find the stratified 5-Fold cross-validation 
to confirm the model’s performance. Based on accuracy, precision, recall, and Fl-score value, it was found 
that random forest still had the highest accuracy. 

In terms of next step for this research, internet of things (IoT) will be used to work with sensor 
devices and connect through internet to allow the device to receive and send data to one another. Data were 
collected from the characteristics of the sweet tamarind placement, the movement of the tamarind, and the 
distance between the tamarind on the conveyor belt from the real situation. This enables the collection of data 
to form big data to generate further machine learning which then leads to the development of an automated 
system of quality sorting of sweet tamarind in Thailand for future industrial applications. 
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